Everything is defined inside
setup.R
In statistical analysis, understanding the patterns of missing data is crucial for selecting appropriate handling methods. The primary patterns are:
Univariate Missing Data: Occurs when only a single variable has missing values. This pattern is uncommon across most disciplines and typically arises in experimental studies.
Monotone Missing Data: Characterized by a situation where the missingness of one variable implies the missingness of all subsequent variables. This pattern is often observed in longitudinal studies where participants drop out and do not return. The monotone pattern is generally easier to manage, as the missingness follows a clear, observable sequence.
Non-Monotone Missing Data: Occurs when the missingness of one variable does not predict the missingness of other variables. This pattern is more complex and requires careful analysis to handle appropriately.
Another way to classify missing values is based on their underlying mechanism.
Rubin’s missing data theory introduces a dataset \(Y\) that is partitioned into observed values (\(Y_o\)) and missing values (\(Y_m\)). The presence or absence of data is tracked by an indicator matrix \(R\) defined for each element of \(Y\) as:
\[ R = \begin{cases} 0 & \text{if } Y \text{ is observed} \\ 1 & \text{if } Y \text{ is missing} \end{cases} \]
Missing Completely at Random (MCAR) defines a mechanism where the probability of missingness \(P(R|q)\) has no relationship with any values in the dataset. Implementation can be univariate, where missing values are inserted through random selection or Bernoulli trials with probability \(p\), or multivariate, where missing values are distributed either uniformly or randomly across the entire dataset.
Missing At Random (MAR/CAR) occurs when missingness probability \(P(R|Y_o, q)\) depends on observed data patterns but not on missing values. For univariate cases, missing values are determined by the patterns of observed data, such as rank-based probabilities or percentile thresholds. In multivariate cases, missingness arises due to relationships or correlations between feature pairs.
Missing Not at Random (MNAR): Missingness depends on both the observed and unobserved data \(p(R|Y_0,Y_m,q)\), making it difficult to model and handle. Identifying and managing MNAR data requires complex assumptions about the relationship between missing and observed values.
This study focuses on Missing at Random (MAR) data due to its practical relevance and analytic tractability compared to Missing Completely at Random (MCAR) and Missing Not at Random (MNAR).
We exclude Missing Completely at Random (MCAR) data because its randomness preserves the dataset’s underlying distribution, making missingness equivalent to reduced sample size without introducing bias. Handling MNAR (Missing Not at Random) data is beyond this study’s scope because identifying and modeling its dependence on unobserved variables is often ambiguous and requires unverifiable assumptions.
Let’s create a simple dataset to demonstrate the different mechanisms of missing values and explore various ways to visualize them.
synthetic_data <- synthetic_dataset_gen(n_samples = 500,
n_covariates = 5,
correlation = "none",
target_type = "linear",
noise_level = 0.5)The delete_MAR_censoring function introduces missing
values in a dataset based on a MAR mechanism. Missingness depends on
values in a control column (cols_ctrl).
Censoring mechanism:
data.MAR.uni <- delete_MAR_censoring(synthetic_data, p = 0.1, cols_mis = c('X1'), cols_ctrl = c('X2'))
# Create a new dataset indicating missing status
intersection_dataset <- synthetic_data
intersection_dataset$missing <- ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
plot_missing_data(data.MAR.uni, "MAR uni", c("white", red_nord))# Calculate global ranges for X1 and all Y variables (X2-X5)
x_range <- range(synthetic_data$X1, na.rm = TRUE)
y_range <- range(synthetic_data[,paste0("X", 2:5)], na.rm = TRUE)
# Add padding to ranges
x_pad <- diff(x_range) * 0.05
y_pad <- diff(y_range) * 0.05
# Plot for X1 vs X2
plot_data2 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X2,
missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)
scatter_plot2 <- ggplot(plot_data2, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X2") +
theme(legend.position = "none") +
coord_fixed()
final_plot2 <- ggMarginal(scatter_plot2, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Plot for X1 vs X3
plot_data3 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X3,
missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)
scatter_plot3 <- ggplot(plot_data3, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X3") +
theme(legend.position = "none") +
coord_fixed()
final_plot3 <- ggMarginal(scatter_plot3, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Plot for X1 vs X4
plot_data4 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X4,
missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)
scatter_plot4 <- ggplot(plot_data4, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X4") +
theme(legend.position = "none") +
coord_fixed()
final_plot4 <- ggMarginal(scatter_plot4, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Plot for X1 vs X5
plot_data5 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X5,
missing = ifelse(is.na(data.MAR.uni$X1), "Missing", "Present")
)
scatter_plot5 <- ggplot(plot_data5, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X5") +
theme(legend.position = "none") +
coord_fixed()
final_plot5 <- ggMarginal(scatter_plot5, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Arrange the plots in a 2x2 grid
grid.arrange(final_plot2, final_plot3, final_plot4, final_plot5,
ncol = 2,
top = textGrob("MAR censoring method visualisation", gp = gpar(fontsize = 16, font = 2)))The delete_MAR_1_to_x() function in R creates MAR values
by controlling missingness in one column (cols_mis) based
on the values of another column (cols_ctrl). It splits the
data into two groups using a cutoff value (e.g., median) in
cols_ctrl. Group 1 contains rows below the cutoff, and
group 2 contains rows above. The x parameter sets the odds
ratio of missing data: for x = 3, group 2 is 3
times more likely to have missing values than group 1.
data.MAR.multi <- delete_MAR_1_to_x(synthetic_data,
p = 0.2,
cols_mis = c('X1'),
cols_ctrl = c('X3'),
x = 50)
data.MAR.multi <- delete_MAR_1_to_x(data.MAR.multi,
p = 0.2,
cols_mis = c('X1'),
cols_ctrl = c('X4'),
x = 50)
data.MAR.multi <- delete_MAR_1_to_x(data.MAR.multi,
p = 0.2,
cols_mis = c('X2'),
cols_ctrl = c('X5'),
x = 50)
plot_missing_data(data.MAR.multi, "MAR multi", c("white", red_nord))# Calculate global ranges for X1 and all Y variables (X2-X5)
x_range <- range(synthetic_data$X1, na.rm = TRUE)
y_range <- range(synthetic_data[,paste0("X", 2:5)], na.rm = TRUE)
# Add padding to ranges
x_pad <- diff(x_range) * 0.05
y_pad <- diff(y_range) * 0.05
# Plot for X1 vs X2
plot_data2 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X2,
missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)
scatter_plot2 <- ggplot(plot_data2, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X2") +
theme(legend.position = "none") +
coord_fixed()
final_plot2 <- ggMarginal(scatter_plot2, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Plot for X1 vs X3
plot_data3 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X3,
missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)
scatter_plot3 <- ggplot(plot_data3, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X3") +
theme(legend.position = "none") +
coord_fixed()
final_plot3 <- ggMarginal(scatter_plot3, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Plot for X1 vs X4
plot_data4 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X4,
missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)
scatter_plot4 <- ggplot(plot_data4, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X4") +
theme(legend.position = "none") +
coord_fixed()
final_plot4 <- ggMarginal(scatter_plot4, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Plot for X1 vs X5
plot_data5 <- data.frame(
X1 = synthetic_data$X1,
Y = synthetic_data$X5,
missing = ifelse(is.na(data.MAR.multi$X1), "Missing", "Present")
)
scatter_plot5 <- ggplot(plot_data5, aes(x = X1, y = Y, color = missing)) +
geom_point() +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) +
scale_x_continuous(limits = c(x_range[1] - x_pad, x_range[2] + x_pad)) +
scale_y_continuous(limits = c(y_range[1] - y_pad, y_range[2] + y_pad)) +
theme_minimal() +
labs(color = "Data Status", x = "X1", y = "X5") +
theme(legend.position = "none") +
coord_fixed()
final_plot5 <- ggMarginal(scatter_plot5, margins = "y",
groupColour = TRUE, groupFill = TRUE,
type = "boxplot", size = 10, width = 0.15, outlier.size = 1)
# Arrange the plots in a 2x2 grid
grid.arrange(final_plot2, final_plot3, final_plot4, final_plot5,
ncol = 2,
top = textGrob("MAR likelihood method visualisation", gp = gpar(fontsize = 16, font = 2)))This section reviews imputation techniques for handling missing data, from simple deletion methods to advanced machine learning approaches, highlighting their uses and impact on data reliability.
| Imputation Technique | Description | Strengths | Limitations |
|---|---|---|---|
| List-wise Deletion | Removes cases with missing values. | Simple; works for MCAR data. | Causes data loss and bias if MCAR is violated. |
| Pairwise Deletion | Removes data points only when necessary. | Retains more data; less bias for MCAR/MAR. | Inconsistent sample sizes; potential bias. |
| Simple Imputation | Replaces missing values with mean, median, or mode. | Easy and efficient. | Can distort data and introduce bias. |
| Regression Imputation | Predicts missing values using regression. | Preserves sample size; good for MAR. | Assumes linearity; less effective for non-linear data. |
| Hot-deck Imputation | Uses similar cases to fill missing values. | Avoids model assumptions. | Less reliable for small datasets. |
| Expectation-Maximization (EM) | Iteratively estimates missing values. | Accurate imputations; handles MAR well. | Computationally intensive. |
| Multiple Imputation | Generates multiple datasets to reflect uncertainty. | Reduces bias; more robust results. | Time-consuming; complex to combine results. |
| GAM Imputation | Uses Generalized Additive Models for non-linear data. | Flexible for non-linear data. | More complex and computationally expensive. |
| Decision Tree / Random Forest | Builds models using decision trees. | Captures complex patterns; handles various data types. | Computationally heavy; prone to overfitting. |
| K-Nearest Neighbors (KNN) | Imputes based on nearest neighbors. | Flexible and captures patterns. | High computational cost for large datasets. |
Since we possess the original dataset prior to the introduction of missing values, we will introduce two metrics to compare the distributions of two datasets: the original dataset without missing values and the dataset in which missing values have been imputed using various techniques
We indeed care about comparing the distribution of the two datasets more than comparing pointwise the actual reconstruction of the dataset.
The Jensen–Shannon Divergence (JSD) quantifies the similarity or divergence between two probability distributions \(P\) and \(Q\). It is based on the midpoint distribution \(M = \frac{1}{2}(P + Q)\) and is calculated as the average of the Kullback-Leibler (KL) divergences of \(P\) and \(Q\) relative to \(M\). JSD is symmetric and always non-negative, with values ranging between \(0\) (identical distributions) and \(\log(2)\) (distributions with disjoint supports).
The Wasserstein distance provides a way to measure the cost of transforming one probability distribution into another. It can be viewed as the minimum “effort” required to transport all the mass of one distribution to match the other, where the cost depends on both the amount of mass moved and the distance it travels. Unlike divergence-based measures (e.g. KL or JS divergence), Wasserstein distance naturally accounts for the geometry of the sample space, making it particularly useful when the shapes or locations of the distributions differ.
The following paragraph presents visualizations of various imputation techniques applied to different types of datasets to observe their impact. The reconstructed datasets are then evaluated and compared using two distribution-based metrics.
# Define methods with their display names and corresponding functions
imputation_methods <- list(
"Original Dataset" = function(data) { data }, # Custom if statement for this
"Mean Imputation" = function(data) { simple_imputation(data, "mean") },
"Hot Deck" = function(data) { hot_deck_imputation(data) },
"KNN Imputation" = function(data) { imputed_data <- kNN(data, imp_var = FALSE)},
"Regression" = function(data) { regression_imputation(data) },
"Regression with Noise" = function(data) { regression_imputation(data, noise = TRUE) },
"GAM" = function(data) { gam_based_imputation(data) },
"GAM with Noise" = function(data) { gam_based_imputation(data, noise = TRUE) },
"Random Forest" = function(data) { tree_based_imputation(data) },
"Random Forest with Noise" = function(data) { tree_based_imputation(data, noise=TRUE) }
)We begin by testing the imputation techniques on data characterized by a linear relationship and heteroscedasticity. Methods such as regression imputation are expected to perform well in this scenario, as they are designed to capture and model linear patterns within the data.
We begin by demonstrating how various noise strategies can be incorporated into the imputation methods, as this technique will be applicable in future scenarios as well.
# Generate and create missing data
n <- 200
p <- 0.3
linear_data <- generate_data(n, "linear",x_range = c(0, 1), noise_sd = 0.5,
homoscedasticity = F, min_sd_noise = 0.1)
# linear_missing <- delete_MAR_1_to_x(linear_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 100)
# linear_missing <- delete_MAR_1_to_x(linear_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)
linear_missing <- delete_MAR_1_to_x(linear_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)
noise_imputation_methods <- list(
"Regression" = function(data) { regression_imputation(data) },
"Regression with Noise" = function(data) { regression_imputation(data, noise = TRUE) }
)
# Use the new combined analysis function
noise_result <- plot_imputations_and_metrics(original_data = linear_data, missing_data = linear_missing, imputation_methods = noise_imputation_methods)
# Access the list of individual plots
noise_plots <- noise_result$imputation_plots
# Arrange and display them as needed
grid.arrange(grobs = noise_plots, ncol = 2)# Use the new combined analysis function
linear_result <- plot_imputations_and_metrics(original_data = linear_data, missing_data = linear_missing, imputation_methods = imputation_methods)
# Access the list of individual plots
imputation_plots <- linear_result$imputation_plots
# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)The second dataset is constructed with a quadratic relationship and an unbalanced number of data points. In cases involving quadratic relationships, we anticipate that more flexible imputation methods, such as Generalized Additive Models (GAM) and Random Forest, will outperform simple linear regression due to their ability to capture complex, non-linear patterns in the data. The relationship is generated via: \(x^2 + \epsilon\).
# Generate and create missing data
quad_data <- generate_data(n, "quadratic",x_range = c(-2, 2), noise_sd = 1, balanced = F, alpha = 1, beta = 2)
quad_missing <- delete_MAR_1_to_x(quad_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)
quad_result <- plot_imputations_and_metrics(quad_data, quad_missing, imputation_methods)
# Access the list of individual plots
imputation_plots <- quad_result$imputation_plots
# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)The third dataset is designed with a piecewise relationship, where distinct segments of the data follow different patterns. For this type of distribution, we expect imputation methods capable of handling discontinuities and local variations, such as Random Forest and k-Nearest Neighbors, to perform better than global modeling approaches like linear regression, which may struggle to accurately capture the segmented structure of the data. To avoid missingness in only one of the pieces, we lowered the odds to 3 instead of the usual 10.
The dependent variable \(x_2\) is generated using the following piecewise relationship:
\[ x_2 = \begin{cases} 2 + 2x_1 + \epsilon, & \text{if } x_1 < \text{pivot} \\ 15 + 5x_1 + \epsilon, & \text{if } x_1 \geq \text{pivot} \end{cases} \]
Here, \(\text{pivot} = \frac{x_{\text{range}[1]} + x_{\text{range}[2]}}{2}\), and \(\epsilon\) is a random noise term added to introduce variability.
# Generate and create missing data
piecewise_data <- generate_data(n, "piecewise", noise_sd = 2)
piecewise_missing <- delete_MAR_1_to_x(piecewise_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 3)
piecewise_result <- plot_imputations_and_metrics(piecewise_data, piecewise_missing, imputation_methods)
# Access the list of individual plots
imputation_plots <- piecewise_result$imputation_plots
# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)The final dataset follows a logarithmic relationship, providing insight into how different imputation methods handle non-linear but monotonic patterns. The dataset is generated by: \(3 \log(x) + \epsilon\)
# Generate and create missing data
log_data <- generate_data(n, "log", x_range = c(1, 100), noise_sd = 1)
log_missing <- delete_MAR_1_to_x(log_data, p, cols_mis = "x2", cols_ctrl = "x1", x = 10)
log_result <- plot_imputations_and_metrics(log_data, log_missing, imputation_methods)
# Access the list of individual plots
imputation_plots <- log_result$imputation_plots
# Arrange and display them as needed
grid.arrange(grobs = imputation_plots, ncol = 2)In this second part of our project, we address the challenge of missing data in real-world datasets. Unlike the previous section, our approach here begins with a complete dataset —specifically, a refined version of the well-known Auto MPG dataset. We will systematically introduce missing values into one variable, conditioned on two other variables, to simulate a Missing at Random mechanism. This will be done using functions from the VIM package in R.
Following the introduction of missing data, we will apply the various imputation techniques, previously discussed, to generate complete datasets. However, instead of simply comparing the distributions of these imputed datasets, our focus will shift to evaluating their performance in the context of linear regression models. Specifically, we will assess the impact of different imputation methods by comparing the estimated coefficients and Mean Squared Errors of the fitted models.
In the final segment of this section, we will extend our analysis by introducing missing values across multiple covariates, conditioned on the remaining variables, resulting in a dataset with a substantial proportion of missing data. To address this, we will employ multiple imputation techniques, particularly those implemented in the MICE package, which utilizes chained equations for imputing multiple variables. Subsequently, we will compare the performance of linear models fitted on these newly imputed datasets.
We have selected the Auto MPG dataset for this analysis due to its simplicity and recognition in the field. Our objective is to fit regression models starting from this dataset and systematically evaluate the effects of various imputation strategies on model performance.
Let’s start by analyzing the data, exploring its features, and examining their relationships.
The “auto-mpg” dataset tracks various technical information about car models manufactured in the 1970s and 1980s. The response variable will be mpg (miles per gallon).
First, we examine the variables and observations in the dataset:
## 'data.frame': 398 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : int 8 8 8 8 8 8 8 8 8 8 ...
## $ displacement: num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : chr "130.0" "165.0" "150.0" "150.0" ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration: num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ model_year : int 70 70 70 70 70 70 70 70 70 70 ...
## $ origin : int 1 1 1 1 1 1 1 1 1 1 ...
## $ car_name : chr "chevrolet chevelle malibu" "buick skylark 320" "plymouth satellite" "amc rebel sst" ...
## mpg cylinders displacement horsepower
## Min. : 9.00 Min. :3.000 Min. : 68.0 Length:398
## 1st Qu.:17.50 1st Qu.:4.000 1st Qu.:104.2 Class :character
## Median :23.00 Median :4.000 Median :148.5 Mode :character
## Mean :23.51 Mean :5.455 Mean :193.4
## 3rd Qu.:29.00 3rd Qu.:8.000 3rd Qu.:262.0
## Max. :46.60 Max. :8.000 Max. :455.0
## weight acceleration model_year origin
## Min. :1613 Min. : 8.00 Min. :70.00 Min. :1.000
## 1st Qu.:2224 1st Qu.:13.82 1st Qu.:73.00 1st Qu.:1.000
## Median :2804 Median :15.50 Median :76.00 Median :1.000
## Mean :2970 Mean :15.57 Mean :76.01 Mean :1.573
## 3rd Qu.:3608 3rd Qu.:17.18 3rd Qu.:79.00 3rd Qu.:2.000
## Max. :5140 Max. :24.80 Max. :82.00 Max. :3.000
## car_name
## Length:398
## Class :character
## Mode :character
##
##
##
There are 9 variables with a total of 398 observations.
mpg: A continuous variable indicating the gallons of
fuel used to travel one mile.
cylinders: A discrete quantitative variable
indicating the number of engine cylinders (to be treated as a
factor).
displacement: A continuous variable indicating
engine displacement.
horsepower: A character variable indicating the
horsepower of the car (to be treated as a continuous quantitative
variable).
weight: A continuous variable indicating the car’s
weight.
acceleration: A continuous variable indicating the
car’s acceleration.
model_year: An integer variable indicating the year
the car model was produced.
origin: A categorical variable indicating the
continent of the car manufacturer (1: America, 2: Europe, 3:
Asia).
car_name: A qualitative variable indicating the
manufacturer and model name of the car.
Before building our model, we need to make the data usable. We
observe anomalies in the dataset variables, so we clean the data when
possible. We also assign the correct data types to variables and create
two new variables: cut_model_year and
car_brand.
cut_model_year: A categorical variable dividing
production years into the classes 70-73, 74-76, 77-79, and 80-82.car_brand: A categorical variable that tracks only the
car’s manufacturer.## 'data.frame': 391 obs. of 9 variables:
## $ mpg : num 18 15 18 16 17 15 14 14 14 15 ...
## $ cylinders : Factor w/ 5 levels "3","4","5","6",..: 5 5 5 5 5 5 5 5 5 5 ...
## $ displacement : num 307 350 318 304 302 429 454 440 455 390 ...
## $ horsepower : num 130 165 150 150 140 198 220 215 225 190 ...
## $ weight : num 3504 3693 3436 3433 3449 ...
## $ acceleration : num 12 11.5 11 12 10.5 10 9 8.5 10 8.5 ...
## $ origin : Factor w/ 3 levels "1","2","3": 1 1 1 1 1 1 1 1 1 1 ...
## $ cut_model_year: Factor w/ 4 levels "70-73","74-76",..: 1 1 1 1 1 1 1 1 1 1 ...
## $ car_brand : Factor w/ 28 levels "amc","audi","bmw",..: 6 4 20 1 11 11 6 20 21 1 ...
## - attr(*, "na.action")= 'omit' Named int 29
## ..- attr(*, "names")= chr "29"
## mpg cylinders displacement horsepower weight
## Min. :10.00 3: 4 Min. : 68.0 Min. : 46.0 Min. :1613
## 1st Qu.:17.25 4:199 1st Qu.:105.0 1st Qu.: 75.0 1st Qu.:2224
## Median :23.00 5: 3 Median :151.0 Median : 93.0 Median :2800
## Mean :23.48 6: 83 Mean :194.1 Mean :104.2 Mean :2973
## 3rd Qu.:29.00 8:102 3rd Qu.:264.5 3rd Qu.:125.0 3rd Qu.:3611
## Max. :46.60 Max. :455.0 Max. :230.0 Max. :5140
##
## acceleration origin cut_model_year car_brand
## Min. : 8.00 1:244 70-73:123 ford : 49
## 1st Qu.:13.75 2: 68 74-76: 90 chevrolet: 47
## Median :15.50 3: 79 77-79: 93 plymouth : 31
## Mean :15.53 80-82: 85 dodge : 28
## 3rd Qu.:17.00 amc : 27
## Max. :24.80 toyota : 26
## (Other) :183
After these steps, we obtain a new dataset that contains 9 variables and 391 observations:
mpg: Continuous, indicating fuel consumption per
mile.cylinders: Discrete, indicating the number of
cylinders, categorized into 5 levels: 3, 4, 5, 6, 8.displacement: Continuous, indicating engine
displacement.horsepower: Continuous, indicating car horsepower.weight: Continuous, indicating car weight.acceleration: Continuous, providing a parameter for the
car’s acceleration.origin: Categorical, indicating the continent of origin
(1: America, 2: Europe, 3: Asia).cut_model_year: Categorical, dividing production years
into the classes 70-73, 74-76, 77-79, and 80-82.car_brand: Categorical, tracking the car’s
manufacturer.Some variables have differences in the summary statistics due to the removal of elements with null values.
Let’s start by visualizing the correlation matrix of the continuous variables:
Our analysis reveals both significant positive and negative
correlations among the variables. Notably, acceleration
exhibits the weakest correlations with other variables, with
coefficients ranging from -0.69 to 0.42. In contrast, variables such as
weight, horsepower, and
displacement demonstrate strong positive intercorrelations.
This observation aligns with expectations, as higher displacement
typically results in increased horsepower, and heavier vehicles
necessitate more powerful engines. Consequently, these more powerful
engines tend to consume more fuel, which is consistent with anticipated
outcomes.
Scatterplots show clear patterns for the first three variables. The positive correlation between acceleration and mpg is less obvious because it has a lower absolute value.
The histogram of displacement shows a positively skewed distribution with a peak around 100. The distributions of weight and horsepower are also positively skewed, with heavier right tails. The acceleration variable shows a perfectly symmetrical distribution with mean, mode, and median around 15.
The histogram shows that mpg has a positively skewed distribution with a long right tail. This is more evident when a kernel density curve is added to the histogram, suggesting that transformations may be necessary.
We analyze categorical variables in relation to mpg using boxplots. For the high number of categories in car_brand, we use a bar plot instead.
The number of cylinders slightly influences mpg. We notice that 4-cylinder engines appear to be the most efficient (in fact, mid-to-high range compact cars generally use this type of engine, where low fuel consumption is a key requirement). Conversely, 8-cylinder engines are the least efficient in terms of fuel consumption (as 8-cylinder engines, typically V8s, are characteristic of supercars or American muscle cars).
The country where the car manufacturer is based also seems to impact fuel consumption. American manufacturers tend to produce cars with higher fuel consumption, while Asian manufacturers perform slightly better than European ones.
The year of production shows a positive trend regarding fuel efficiency as the years progress. This could be attributed to technological advancements enabling the production of more fuel-efficient cars or to the oil crises of the 1970s, which may have pushed manufacturers in this direction.
Finally, regarding the car manufacturer, it is difficult to identify very significant patterns. The only notable observation is that some brands, like Nissan, for example, have a much higher average miles-per-gallon compared to brands like Chrysler (noting that the former is an Asian brand, while the latter is American).
The graphs show unimodal curves with differing frequency peaks and averages based on the continent of origin:
This confirms that American cars generally consume more fuel, followed by European cars, with Asian cars being the most fuel-efficient.
The distributions are similar across categories, with high peaks followed by heavier right tails. This suggests that the number of cylinders is not a major factor in explaining mpg. However, cars with more cylinders tend to achieve fewer miles per gallon. Data on 5-cylinder cars is limited, making conclusions difficult.
Production year initially appears to have little influence, but there is a marked change in distribution patterns for cars from the 1980s, particularly compared to those from the early 1970s.
In this subsection, we consider the construction and assessment of a linear regression model utilizing the complete Auto MPG dataset. Initially, we partition the dataset into training and testing subsets, employing an 80-20 split to ensure robust model validation. Subsequently, we fit multiple linear regression models to identify the optimal combination of predictor variables. Our primary model focuses on predicting the natural logarithm of mpg based on variables such as weight, horsepower, and categorized model year. We then evaluate the performance of this model, comparing it against alternative models with different predictor combinations, to determine the most effective predictors for fuel efficiency.
From the assessment of the models we obtain that the model that uses as predictors weight, horsepower and cut_model_year is the best in fitting data.
As anticipated, we generate missing values in the dataset with MAR mechanism. We underline that we decided to remove values on the predictor weight with respect to acceleration and displacement, since its high correlation with other variables and importance role in predicting the target.
cutoff_75th_percentile <- function(x) {quantile(x, 0.75)}
missingMar <- delete_MAR_1_to_x(data_original, p = 0.2, cols_mis = "weight",cols_ctrl = "displacement", x = 10, cutoff_fun = cutoff_75th_percentile)
missingMar<-delete_MAR_censoring(missingMar, p = 0.1, cols_mis = "weight", cols_ctrl = "acceleration")
#plot VIM
aggr_plot <- aggr(missingMar, numbers = TRUE,labels=names(data_original), prop = c(TRUE, FALSE), col = nord_contrast)#plot 2
missing.rows = dim(missingMar)[1] - dim(na.omit(missingMar))[1]
sprintf("Missing rows: %s (%s%%)", missing.rows, round((missing.rows*100)/dim(missingMar)[1], 2))## [1] "Missing rows: 76 (24.2%)"
missings_df <- data.frame(type=c("missing", "non-missing") ,count = c(missing.rows, dim(na.omit(missingMar))[1]))
ggplot(missings_df, aes(fill=type, y="", x=count)) +
geom_bar(position="stack", stat="identity")+
ggtitle("Missing data frequency") +
xlab("Obs") + ylab("") +
theme(text = element_text(size = 18))+
scale_fill_manual(values = c(nord_colors[2], nord_colors[3]))+
theme_bw()# Calculate global ranges for both x and y variables
dis_range <- range(c(data_original$displacement, missingMar$displacement), na.rm = TRUE)
weight_range <- range(c(data_original$weight, missingMar$weight), na.rm = TRUE)
accel_range <- range(c(data_original$acceleration, missingMar$acceleration), na.rm = TRUE)
# Add padding to ranges
dis_pad <- diff(dis_range) * 0.05
weight_pad <- diff(weight_range) * 0.05
accel_pad <- diff(accel_range) * 0.05
# First plot: DIS vs. Weight
plot_data1 <- data_original[, c("weight", "displacement")]
plot_data1$missing <- ifelse(is.na(missingMar$weight), "Missing", "Present")
scatter_plot1 <- ggplot(plot_data1, aes(x = weight, y = displacement, color = missing)) +
geom_point(size = 2) + # Increased point size
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black")) + # Corrected color mapping
scale_y_continuous(limits = c(dis_range[1] - dis_pad, dis_range[2] + dis_pad)) +
scale_x_continuous(limits = c(weight_range[1] - weight_pad, weight_range[2] + weight_pad)) +
theme_minimal(base_size = 12) + # Increased base font size
labs(color = "Data Status", x = "Weight", y = "Displacement") +
theme(legend.position = "none")
final_plot1 <- ggMarginal(scatter_plot1,
margins = "y",
groupColour = TRUE,
groupFill = TRUE,
type = "boxplot",
size = 5,
width = 0.2,
outlier.size = 1)
# Second plot: Acceleration vs. Weight
plot_data2 <- data_original[, c("weight", "acceleration")]
plot_data2$missing <- ifelse(is.na(missingMar$weight), "Missing", "Present")
scatter_plot2 <- ggplot(plot_data2, aes(x = weight, y = acceleration, color = missing)) +
geom_point(size = 2) +
scale_color_manual(values = c("Missing" = red_nord, "Present" = "black" )) +
scale_y_continuous(limits = c(accel_range[1] - accel_pad, accel_range[2] + accel_pad)) +
scale_x_continuous(limits = c(weight_range[1] - weight_pad, weight_range[2] + weight_pad)) +
theme_minimal(base_size = 12) +
labs(color = "Data Status", x = "Weight", y = "Acceleration") +
theme(legend.position = "none")
final_plot2 <- ggMarginal(scatter_plot2,
margins = "y",
groupColour = TRUE,
groupFill = TRUE,
type = "boxplot",
size = 5,
width = 0.2,
outlier.size = 1)
# Combine the two plots using gridExtra::grid.arrange()
grid.arrange(final_plot1, final_plot2, ncol=1)In this section, we employ a 3D scatter plot to elucidate the relationships among ‘displacement’, ‘acceleration’, and ‘weight’ within our dataset. This visualization not only highlights the interplay between these variables but also distinctly marks the data points where ‘weight’ values are missing.
# Assuming `data_original` is your original dataset and `missingMar` is the modified one
plotMar_data <- data_original %>%
mutate(missing = ifelse(is.na(missingMar$weight), "Missing", "Present")) # Flag missing/present
# Create the 3D plot based on `plotMar_data`
plot_ly(data = plotMar_data,
x = ~displacement,
y = ~acceleration,
z = ~weight, # Use original weight for z-axis
color = ~missing,
colors = c("Present" = "black", "Missing" = red_nord), # Ensure 'red' is defined
type = 'scatter3d',
mode = 'markers',
marker = list(size = 4)) %>% # Adjust point size here
plotly::layout(
title = '3D Plot of Missing vs Present Data',
scene = list(
xaxis = list(title = 'Displacement'),
yaxis = list(title = 'Acceleration'),
zaxis = list(title = 'Weight')
)
)In this section, we address the challenge of missing data within our dataset by applying various imputation methods. The goal is to assess how different imputation strategies influence the performance and outcomes of linear regression models.
imp_ds_mean <- impute_mean(missingMar)
imp_ds_median <- impute_median(missingMar)
imp_del_list <- listwise_deletion(missingMar)
imp_ds_reg <- regression_imputation(missingMar[, -which(names(missingMar) == "car_brand")], noise = TRUE)
imp_ds_rf <- tree_based_imputation(missingMar[, -which(names(missingMar) == "car_brand")], noise = TRUE)
# sub dataset for Gam model
prova <- missingMar[, !names(missingMar) %in% c("car_brand", "origin", "cylinders", "cut_model_year")]
imp_ds_gam <- gam_based_imputation(prova, noise = TRUE, max_predictors = 4)
imp_ds_gam <- cbind(imp_ds_gam, missingMar[, c("car_brand", "origin", "cylinders", "cut_model_year")])Post-imputation, we fit a linear regression model to each imputed dataset. The models predict the logarithm of miles per gallon (log(mpg)) using predictors such as weight, horsepower, and cut_model_year. This approach enables us to evaluate and compare the impact of each imputation method on the regression coefficients and overall model performance.
By systematically applying these imputation techniques and fitting corresponding linear models, we aim to identify the most effective strategies for handling missing data in our analysis.
fit_mean<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_mean)
fit_median<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_median)
fit_list<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_del_list)
fit_reg<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_reg)
fit_rf<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_rf)
fit_gam<-lm(I(log(mpg))~weight+horsepower + cut_model_year, data = imp_ds_gam)In this section, we present a comprehensive summary and analysis of the coefficients derived from linear models fitted to datasets subjected to various imputation techniques. The imputation methods employed include mean imputation, median imputation, listwise deletion, regression imputation, random forest imputation, and GAM based imputation. For each imputed dataset, a linear model was fitted with the logarithm of miles per gallon as the response variable and weight, horsepower, and cat_model_year as predictor variables.
We begin by summarizing the fitted models, highlighting key statistics such as coefficient estimates, standard errors, and significance levels. Subsequently, we conduct a detailed coefficient analysis, focusing on the estimates and their confidence intervals for each predictor across the different imputation methods. This analysis is further visualized through plots that compare the coefficients and their confidence intervals, providing insights into the variability and stability of the estimates resulting from the different imputation strategies.
By examining these comparisons, we aim to assess the impact of various imputation methods on the model coefficients and to identify which methods yield the most reliable and consistent estimates.
##
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year,
## data = imp_ds_mean)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41008 -0.08189 0.00002 0.09071 0.41170
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.927e+00 3.875e-02 101.342 < 2e-16 ***
## weight -1.555e-04 1.523e-05 -10.215 < 2e-16 ***
## horsepower -4.967e-03 2.607e-04 -19.055 < 2e-16 ***
## cut_model_year74-76 3.946e-02 2.181e-02 1.809 0.0714 .
## cut_model_year77-79 1.718e-01 2.104e-02 8.169 8.14e-15 ***
## cut_model_year80-82 3.015e-01 2.345e-02 12.857 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1326 on 308 degrees of freedom
## Multiple R-squared: 0.8444, Adjusted R-squared: 0.8419
## F-statistic: 334.3 on 5 and 308 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year,
## data = imp_ds_median)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.41919 -0.08380 -0.00065 0.08807 0.41444
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.913e+00 3.931e-02 99.550 < 2e-16 ***
## weight -1.400e-04 1.471e-05 -9.516 < 2e-16 ***
## horsepower -5.252e-03 2.539e-04 -20.684 < 2e-16 ***
## cut_model_year74-76 3.469e-02 2.214e-02 1.567 0.118
## cut_model_year77-79 1.676e-01 2.136e-02 7.849 7.01e-14 ***
## cut_model_year80-82 2.987e-01 2.384e-02 12.530 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1349 on 308 degrees of freedom
## Multiple R-squared: 0.839, Adjusted R-squared: 0.8364
## F-statistic: 321.1 on 5 and 308 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year,
## data = imp_del_list)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.34241 -0.05980 -0.00390 0.06574 0.35892
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.934e+00 3.288e-02 119.643 < 2e-16 ***
## weight -2.712e-04 1.947e-05 -13.929 < 2e-16 ***
## horsepower -1.574e-03 4.984e-04 -3.158 0.0018 **
## cut_model_year74-76 8.591e-02 2.117e-02 4.057 6.78e-05 ***
## cut_model_year77-79 1.834e-01 2.013e-02 9.113 < 2e-16 ***
## cut_model_year80-82 2.958e-01 2.090e-02 14.151 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1108 on 232 degrees of freedom
## Multiple R-squared: 0.8512, Adjusted R-squared: 0.848
## F-statistic: 265.5 on 5 and 232 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year,
## data = imp_ds_reg)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31674 -0.07332 -0.00046 0.07595 0.34464
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.837e+00 2.965e-02 129.392 < 2e-16 ***
## weight -2.626e-04 1.871e-05 -14.030 < 2e-16 ***
## horsepower -7.398e-04 4.496e-04 -1.646 0.100842
## cut_model_year74-76 7.045e-02 2.006e-02 3.512 0.000511 ***
## cut_model_year77-79 1.813e-01 1.900e-02 9.542 < 2e-16 ***
## cut_model_year80-82 3.107e-01 2.121e-02 14.648 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1198 on 308 degrees of freedom
## Multiple R-squared: 0.8729, Adjusted R-squared: 0.8709
## F-statistic: 423.1 on 5 and 308 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year,
## data = imp_ds_rf)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31490 -0.06511 0.00235 0.07253 0.35447
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.893e+00 2.886e-02 134.863 < 2e-16 ***
## weight -2.776e-04 1.702e-05 -16.305 < 2e-16 ***
## horsepower -8.072e-04 3.902e-04 -2.068 0.039429 *
## cut_model_year74-76 6.501e-02 1.861e-02 3.494 0.000546 ***
## cut_model_year77-79 1.713e-01 1.766e-02 9.700 < 2e-16 ***
## cut_model_year80-82 2.960e-01 1.978e-02 14.965 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1124 on 308 degrees of freedom
## Multiple R-squared: 0.8882, Adjusted R-squared: 0.8864
## F-statistic: 489.4 on 5 and 308 DF, p-value: < 2.2e-16
##
## Call:
## lm(formula = I(log(mpg)) ~ weight + horsepower + cut_model_year,
## data = imp_ds_gam)
##
## Residuals:
## Min 1Q Median 3Q Max
## -0.31702 -0.06743 0.00179 0.07344 0.35273
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 3.879e+00 2.892e-02 134.116 < 2e-16 ***
## weight -2.708e-04 1.707e-05 -15.871 < 2e-16 ***
## horsepower -9.055e-04 3.941e-04 -2.298 0.022255 *
## cut_model_year74-76 6.705e-02 1.889e-02 3.550 0.000445 ***
## cut_model_year77-79 1.774e-01 1.795e-02 9.886 < 2e-16 ***
## cut_model_year80-82 2.999e-01 2.005e-02 14.960 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 0.1138 on 308 degrees of freedom
## Multiple R-squared: 0.8854, Adjusted R-squared: 0.8836
## F-statistic: 476 on 5 and 308 DF, p-value: < 2.2e-16
set_ds <- list(imp_ds_mean, imp_ds_median, imp_del_list, imp_ds_reg, imp_ds_rf, imp_ds_gam)
set_fit <- list(fit_mean, fit_median, fit_list, fit_reg, fit_rf, fit_gam)
plots_list <- vector("list", length(set_ds)) # pre-allocate
for (i in seq_along(set_ds)) {
plots_list[[i]] <- my_diagnostic_plots(set_fit[[i]], set_ds[[i]], blue_nord)
}Residual analysis mean
Residual analysis median
Residual analysis deletion
Residual analysis regression
Residual analysis random forest
Residual analysis gam
# Selected coefficient example: Intercept
intercept_df <- extract_coefficients(fit_models, coef_names = "(Intercept)")
intercept_df# Plotting Intercept
plot_coefficients(intercept_df, "Comparison of Intercept Coefficients with Confidence Intervals", y_limits = c(3.65, 4.05))# Selected coefficient example: Weight
weight_df <- extract_coefficients(fit_models, coef_names = "weight")
weight_df# Plotting Weight
plot_coefficients(weight_df, "Comparison of Weight Coefficients with Confidence Intervals")# Extract specific coefficients and their confidence intervals
coef_names <-"horsepower" # Specify the coefficients you want to compare
# Selected coefficient example: Horsepower
horsepower_df <- extract_coefficients(fit_models, coef_names = "horsepower")
horsepower_df# Plotting Horsepower
plot_coefficients(horsepower_df, "Comparison of Horsepower Coefficients with Confidence Intervals")# Selected coefficients: Multiple terms (e.g., cut_model_year)
cut_model_year_df <- extract_coefficients(fit_models, coef_names = c("cut_model_year74-76", "cut_model_year77-79", "cut_model_year80-82"))
# Plotting Multiple Terms
plot_coefficients(cut_model_year_df, "Comparison of Cut Model Year Coefficients with Confidence Intervals")run_imputation_and_modeling <- function(train_data, test_data, missingMar) {
missingComplex <- missingMar
# ----------------------------------------------------------------
# 2. Imputation strategies on missingComplex (or missingMar)
# (Choose which you actually want to use. Below is an example.)
# ----------------------------------------------------------------
imp_ds_mean <- impute_mean(missingComplex)
imp_ds_median <- impute_median(missingComplex)
imp_del_list <- listwise_deletion(missingComplex)
imp_ds_reg <- regression_imputation(
missingComplex[, -which(names(missingComplex) == "car_brand")],
noise = TRUE
)
imp_ds_rf <- tree_based_imputation(
missingComplex[, -which(names(missingComplex) == "car_brand")],
noise = TRUE
)
# For GAM imputation, remove non-numeric or unneeded columns:
prova <- missingComplex[, !names(missingComplex) %in% c("car_brand","origin","cylinders","cut_model_year")]
imp_ds_gam <- gam_based_imputation(prova, noise = TRUE, max_predictors = 4)
# Add back the dropped columns, if needed
imp_ds_gam <- cbind(imp_ds_gam, missingComplex[, c("car_brand","origin","cylinders","cut_model_year")])
# ----------------------------------------------------------------
# 3. Fit models on each imputed dataset
# ----------------------------------------------------------------
fit_mean <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_mean)
fit_median <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_median)
fit_list <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_del_list)
fit_reg <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_reg)
fit_rf <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_rf)
fit_gam <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = imp_ds_gam)
# (Optional) Fit an "original" model on the *training* data itself
fit_original <- lm(I(log(mpg)) ~ weight + horsepower + cut_model_year, data = train_data)
# ----------------------------------------------------------------
# 4. Organize models for easy iteration
# ----------------------------------------------------------------
models <- list(
"Listwise deletion" = fit_list,
"Mean Imputation" = fit_mean,
"Median Imputation" = fit_median,
"Linear regression (with noise)" = fit_reg,
"Random Forest" = fit_rf,
"GAM Imputation" = fit_gam,
"Original Dataset (Train Fit)" = fit_original
)
# ----------------------------------------------------------------
# 5. Predict & calculate RMSE on the *test set*
# ----------------------------------------------------------------
calculate_rmse <- function(model, new_data, true_values) {
preds <- predict(model, newdata = new_data)
rmse(preds, true_values)
}
results <- lapply(names(models), function(method) {
data.frame(
Method = method,
RMSE = calculate_rmse(models[[method]], test_data, test_data$mpg)
)
})
# Return a single data frame with columns: Method, RMSE
do.call(rbind, results)
}data_original<-auto_mpg
K <- 5
folds <- createFolds(1:nrow(data_original), k = K, list = TRUE)
methods <- c("Listwise deletion",
"Mean Imputation",
"Median Imputation",
"Linear regression (with noise)",
"Random Forest",
"GAM Imputation",
"Original Dataset (Train Fit)")
# store RMSE for each fold in a matrix
rmse_results <- matrix(NA, nrow = K, ncol = length(methods),
dimnames = list(paste0("Fold", 1:K), methods))
# We can also store the entire data-frame style results if you prefer
all_results <- vector("list", K)
for (i in seq_along(folds)) {
test_idx <- folds[[i]]
train_idx <- setdiff(seq_len(nrow(data_original)), test_idx)
train_data <- data_original[train_idx, ]
test_data <- data_original[test_idx, ]
test_data$mpg<-log(test_data$mpg)
# 2a. Induce missingness on train_data
missingMar<-delete_MAR_1_to_x(data_original, p = 0.2, cols_mis = "weight", cols_ctrl ="displacement", x =100)
missingMar<-delete_MAR_censoring(missingMar, p = 0.1, cols_mis = "weight", cols_ctrl = "acceleration")
fold_results <- run_imputation_and_modeling(train_data, test_data, missingMar)
# fold_results is a data frame with columns: Method, RMSE
all_results[[i]] <- data.frame(Fold = i, fold_results)
# Also fill into the rmse_results matrix
for (m in methods) {
rmse_results[i, m] <- fold_results$RMSE[fold_results$Method == m]
}
}
# Create a single data frame of all folds
combined_results <- do.call(rbind, all_results)
# avg per method
aggregate(RMSE ~ Method, data = combined_results, FUN = mean)The analysis revealed that advanced imputation methods, particularly random forest and GAM, yielded superior model performance, as evidenced by more accurate coefficient estimates and narrower confidence intervals. These methods were followed by regression imputation, while simpler approaches like mean imputation and listwise deletion demonstrated comparatively lower performance. This outcome aligns with expectations, as sophisticated imputation techniques are better equipped to capture complex data patterns, leading to more reliable and robust predictive models.
missingMar<-delete_MAR_1_to_x(data_original, p = c(0.3,0.3), cols_mis = c("weight", "horsepower"), cols_ctrl =c("displacement", "acceleration"), x =10)
idx_miss <- which(!complete.cases(missingMar))
df_no_miss <- missingMar[complete.cases(missingMar), ]
tmp<-delete_MAR_1_to_x(df_no_miss, p = c(0.3), cols_mis = c("displacement"), cols_ctrl =c("horsepower"), x =100)
missingComplex <- missingMar
missingComplex[complete.cases(missingMar), ] <- tmp
# Convert categorical variables to factors before imputation
# missingComplex$cut_model_year <- as.factor(missingComplex$cut_model_year)
#plot VIM
aggr_plot <- aggr(missingComplex, numbers = TRUE,labels=names(missingComplex), prop = c(TRUE, FALSE), col = nord_contrast)missing.rows = dim(missingComplex)[1] - dim(na.omit(missingComplex))[1]
missings_df <- data.frame(type=c("missing", "non-missing") ,count = c(missing.rows, dim(na.omit(missingComplex))[1]))
ggplot(missings_df, aes(fill=type, y="", x=count)) +
geom_bar(position="stack", stat="identity")+
ggtitle("Missing data frequency") +
xlab("Obs") + ylab("") +
theme(text = element_text(size = 18))+
scale_fill_manual(values = c(nord_colors[2], nord_colors[3]))+
theme_bw()sprintf("Missing rows: %s (%s%%)", missing.rows, round((missing.rows*100)/dim(missingComplex)[1], 2))## [1] "Missing rows: 263 (67.26%)"
missingComplex_no_year <- subset(missingComplex, select = -cut_model_year)
# Assuming your data frame is named 'df'
no_categorical_dataset <- Filter(function(x) is.numeric(x), missingComplex)complex_ds_mean<-impute_mean(missingComplex)
complex_ds_del<-listwise_deletion(missingComplex)
# Original model
fit_original_bad <- lm(I(log(mpg))~weight+horsepower,data = auto_mpg)
# Custom models
fit_complex_mean<-lm(I(log(mpg))~weight+horsepower, data = complex_ds_mean)
fit_complex_del<-lm(I(log(mpg))~weight+horsepower, data = complex_ds_del)set_ds <- list(complex_ds_del, complex_ds_mean)
set_fit <- list(fit_complex_del, fit_complex_mean)
plots_list <- vector("list", length(set_ds)) # pre-allocate
for (i in seq_along(set_ds)) {
plots_list[[i]] <- my_diagnostic_plots(set_fit[[i]], set_ds[[i]], blue_nord)
}
# Now `plots_list` is a list of patchwork objects. You can print them one at a time:
plots_list[[1]]# Define models and corresponding prediction variables
models <- list(
"Listwise deletion" = fit_complex_del,
"Mean Imputation" = fit_complex_mean,
"Linear regression (with noise)" = final_predictions_reg,
"GAM Imputation" = final_predictions_gam,
"Random Forest" = final_predictions_tree,
"Original Dataset" = fit_original_bad
)
# Function to predict and calculate RMSE
calculate_rmse <- function(model, new_data, true_values, is_final = FALSE) {
# Handle models that already contain final predictions
predictions <- if (is_final) model else predict(model, newdata = new_data)
rmse(predictions, true_values)
}
# Flag for final predictions
final_methods <- c("Linear regression (with noise)", "GAM Imputation", "Random Forest")
# Calculate RMSE for each model
results <- lapply(names(models), function(method) {
rmse_value <- calculate_rmse(
models[[method]],
test_data,
test_data$mpg,
is_final = method %in% final_methods
)
data.frame(Method = method, RMSE = rmse_value)
})
# Combine results into a single data frame
results_df <- do.call(rbind, results)
# Optional: View results in a structured format
results_dfIn this project, we analyzed the problem of missing data, focusing on the Missing at Random (MAR) missingness mechanism. We studied several imputation techniques, as discussed in the literature, and assessed their performance through two main approaches. First, we evaluated how well the distribution of the imputed dataset approximated the original one. Second, we examined the impact of different imputation methods on model fitting, specifically how the estimates of coefficients and the Root Mean Square Error (RMSE) were affected by imputation.
Based on our observations, we can draw the following conclusions:
Advanced Imputation Techniques: More sophisticated imputation methods, such as those based on GAM and Random Forests RF, proved to be highly accurate. When the percentage of missing data is relatively small, these methods produce reliable results.
Listwise Deletion with Simple Models: When applying listwise deletion to impute missing values, the results remained relatively robust, particularly when fitting less flexible models, such as linear regression. These simple models did not perform poorly even under listwise deletion.
Simple imputation and by regression: Simple imputation usually resulted in the worst results, thus proving to be a weak technique. Imputation by regression, especially in the case study, worked enough well, and if the data were linearly related it performed as well as GAM.
Categorical Variables: Our analysis focused on continuous variables. Future work could extend this analysis to categorical variables, exploring the impact of removing values from different factors or the interaction between categorical variables and missing data in imputation.
Bayesian Approaches: This project utilized classical statistical methods with a frequentist approach. Incorporating Bayesian methods, as demonstrated by the MICE (Multivariate Imputation by Chained Equations) algorithm, could provide an interesting avenue for future work. Many studies suggest that Bayesian approaches may offer optimal solutions for imputing missing values, and integrating these methods into our analysis could enhance the results.
Missing Not at Random (MNAR): We did not address the scenario of missing data that is Missing Not at Random (MNAR), which is often the most challenging case. In such cases, missingness depends on unobserved data, rendering common imputation techniques insufficient. This is an important topic to explore, and we propose future developments in this area to improve the handling of MNAR data.
Noise: We tried different methods of adding noise, but this approach appears to be both “simple” (as supported by the literature) and effective. However, the drawback might be that it does not adequately account for the heteroscedasticity of the dataset.